Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workflow for rule validation #22

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

sed-i
Copy link
Contributor

@sed-i sed-i commented Jan 4, 2023

This PR adds a workflow to automatically validate all alert rules in the repo.

Breaking change

At the moment this would be a breaking change because cos-tool (or promtool) does not support our custom "one alert rule per file" format.

We could update cos-tool to automatically wrap such files with groups: before attempting validation or abandon the "one alert rule per file" format altogether. Or both, incrementally.

Fixes #12.

@sed-i
Copy link
Contributor Author

sed-i commented Jan 4, 2023

wdyt @simskij @rbarry82 ?

wget https://github.com/canonical/cos-tool/releases/download/rel-20220621/cos-tool-amd64
chmod +x ./cos-tool-amd64
- name: Validate prometheus alert rules
run: find . -path '*/prometheus_alert_rules/*' -type f -exec ./cos-tool-amd64 --format promql lint {} +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will pass all of the files in at once, which may yield... sort of unreadable results.

Maybe:

find . -path '*/prometheus_alert_rules/*' -type f -print0 | while read -d $'\0' file; do echo "Validating $file"; ./cos-tool-amd64 --format promql lint $file; done

Copy link
Contributor Author

@sed-i sed-i Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to \; in next commit.

This produces the following sample output:

$ find . -path '*/prometheus_alert_rules/*' -type f -exec ./cos-tool-amd64 --format promql lint {} + || echo failed 
error validating: yaml: unmarshal errors:
  line 1: field alert not found in type rulefmt.RuleGroups
  line 2: field expr not found in type rulefmt.RuleGroups
  line 3: field for not found in type rulefmt.RuleGroups
  line 4: field labels not found in type rulefmt.RuleGroups
  line 6: field annotations not found in type rulefmt.RuleGroups
failed

I think we should update cos-tool to also print the filename being inspected

Wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this comment.

This is also due to the single-rule format. cos-tool could print the filename being inspected, but... it's also not gonna be that useful. Because the single-rule format files need to be coerced into a parsable format, and in addition, the Python representation of Loki's rules doesn't quite actually add a name to the groups key, we do it in the CosTool Python class before ever passing it over to cos-tool itself at all.

It could be done in Go, but the honest truth is that inspecting and transforming arbitrary YAML/JSON is not a fun experience. The modules/libraries are good enough about giving options for "map this JSON/YAML onto this struct, and it's ok to ignore some of these fields and leave them uninitialized in the struct maybe, or throw and error if you can't exactly match it", but cos-tool trying to guess what a single rule format alert should "become" versus a "normal" alert group file is pretty awful and Python is just flat-out a better tool for it. To do it in Go ends up being really terrible.

This action would be better off directly invoking the CosTool class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single-rule format files need to be coerced into a parsable format

Still, this seems like a cos-tool internal, after which it could report the original filename

Are you suggesting the CosTool class because %%juju_topology%% could appear in our loki rules?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "original filename" passed to cos-tool, generally, has no significance whatsoever. A temporary file is created from relation data and that's used. loki_push_api actually mangles even that data so it will parse appropriately.

Taking the single rule per file format, which is already handled in Python, and having cos-tool try to figure out what it is and what it's supposed to be, mangle it into that format (which will not be representative of what is actually in the file anyway), and report back a filename for a use case which is CI only while this is covered appropriately at runtime through a combination of tools is not a good use of engineering time.

%%juju_topology%% would be a problem also, and even that is not covered in the CosTool class, but in the AlertRules class before sending it to CosTool at all to keep separation of concerns.

To be clear, this is not an intuitive process in Go most of the time -- unmarshaling YAML/JSON does not give you a map[string]map[string]... with any kind of intelligence, it is a map[string]interface{} which then must be reflected or type asserted on. Imagine if, in Python, loading the rules looked like:

data: Dict[str, object]
data = yaml.safe_load(some_rule_data)

if "group" not in data.keys():
  try:
    data = ruleNode(data)
  except ValueError as e:
    # could not instantiate because the types are wrong
    return someerr
else:
   try:
     # try to type assert `Any` 
     if not isinstance(str, data["groups"].get("name", 0)):
       typing.cast(Dict[str, str], data["groups"]["name"]) = generate_some_name()
     if not isinstance(Dict[str, Any], typing.cast(List[Dict[str, Any]], data["rules"])[0]):
       return somerr
     # we are now more or less sure that there is a "rules" key with a list under it, a "names" key, and a top-level "groups" key
     validate_rules(typing.cast(ruleGroup, data)
   except ValueError as e:
     return someerr

The Go would be more verbose, but loading/unmarshaling arbitrary YAML/JSON into a walkable map without a huge amount of type assertions/casts, and/or blindly trying to map that data onto a struct and catching errors if it doesn't work is the "good" way to do this, idiomatically, in the standard library. It isn't what Go is good at, and an API which was designed for passing this to Go would have a field in the YAML which said "I am this type of object", or would be sent to an explicit endpoint, or in a different folder (and passed to a different func), or similar.

Obviously it can be done, but it's not a class of problems Go excels at, and Python does. Doing this is "good money after bad" when it's already capably handled in Python, and the use case for it is CI only, where we could just... do it in Python and let that invoke cos-tool as normal, even keeping track of the filename it passed through and outputting that is an all-around better design.

The job of cos-tool is to inject label matchers for topology and/or validate our transformed rules (after putting "single rule per file" rules into one group in relation data, substituting %%juju_topology%%, and so on) will work with the actual runtime code of the workload, not to be a catch-all for the thing we're doing to the data. Python does part that, and cos-tool ensures that either "this final ruleGroup will not cause errors in Prom/Loki" or "labels can be injected without trying to fiddle with regexps to partially re-implement query parsing logic from Go in Python". The inverse of this statement is also that re-implementing Juju relation data/application domain rules (single rule file, for example) is the job of charm code, not cos-tool.

@lucabello
Copy link
Contributor

What's the status on this? Is something we still want? @simskij @sed-i

@simskij simskij marked this pull request as draft January 12, 2024 10:53
@simskij
Copy link
Member

simskij commented Jan 12, 2024

Making this a draft until we've decided on how to proceed.

@sed-i
Copy link
Contributor Author

sed-i commented Oct 29, 2024

This came up again as part of canonical/grafana-agent-operator#199.

A usable, partial, solution, is to install the prom snap and:

promtool check rules src/prometheus_alert_rules/*

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add promtool check rules to o11y CI
4 participants